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Abstract 

Background: Zipf's law and Heaps' law are observed in disparate complex systems. Of particular 
interests, these two laws often appear together. Many theoretical models and analyses are performed to 
understand their co-occurrence in real systems, but it still lacks a clear picture about their relation. 
Methodology /Principal Findings: We show that the Heaps' law can be considered as a derivative 
phenomenon if the system obeys the Zipf's law. Furthermore, we refine the known approximate solution 
of the Heaps' exponent provided the Zipf's exponent. Wc show that the approximate solution is indeed an 
asymptotic solution for infinite systems, while in the finite-size system the Heaps' exponent is sensitive to 
the system size. Extensive empirical analysis on tens of disparate systems demonstrates that our refined 
results can better capture the relation between the Zipf's and Heaps' exponents. 

Conclusions /Significance: The present analysis provides a clear picture about the relation between 
the Zipf's law and Heaps' law without the help of any specific stochastic model, namely the Heaps' law 
is indeed a derivative phenomenon from Zipf's law. The presented numerical method gives considerably 
better estimation of the Heaps' exponent given the Zipf's exponent and the system size. Our analysis 
provides some insights and implications of real complex systems, for example, one can naturally obtained 
a better explanation of the accelerated growth of scale-free networks. 

Introduction 

Giant strides in Complexity Sciences have been the direct outcome of efforts to uncover the universal 
laws that govern disparate systems. Zipf's law [I] and Heaps' law [2] are two representative examples. In 
1940s, Zipf found a certain scaling law in the distribution of the word frequencies. Ranking all the words 
in descending order of occurrence frequency and denoting by z(r) the frequency of the word with rank r, 
the Zipf's law reads z(r) = z max • r~ Q , where z max is the maximal frequency and a is the so-called Zipf's 
exponent. This power-law frequency-rank relation indicates a power-law probability distribution of the 
frequency itself, say p(z) ~ with (3 equal to 1 + 1/ a (see Materials and Methods). As a signature 
of complex systems, the Zipf's law is observed everywhere [3]: these include the distributions of firm 
sizes [3] , wealths and incomes [5] , paper citations |Hj , gene expressions [7] , sizes of blackouts [5] , family 
names [5], city sizes [TU], personal donations [TT], chess openings [12], traffic loads caused by YouTubc 
videos |13j . and so on. Accordingly, many mechanisms are put forward to explain the emergence of 
the Zipf's law |14[|15j . such as the rich gets richer |16H17j . the self-organized criticality |18| . Markov 
Processes [19J . aggregation of interacting individuals [20j . optimization designs [21j and the least effort 
principle [33] . To name just a few. 

Heaps' law [2] can also be applied in characterizing natural language processing, according to which 
the vocabulary size grows in a sublinear function with document size, say N(t) ~ t x with A < 1, where 
t denotes the total number of words and N(t) is the number of distinct words. One ingredient causing 
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such a sublinear growth may be the memory and bursty nature of human language |23H25| . A particular 
interesting phenomenon is the coexistence of the Zipf's law and Heaps' law. Gclbukh and Sidorov [26] 
observed these two laws in English, Russian and Spanish texts, with different exponents depending on 
languages. Similar results were recently reported for the corpus of web texts [27) . including the Industry 
Sector database, the Open Directory and the English Wikipedia. Besides the statistical regularities of 
text, the occurrences of tags for online resources [23129] , keywords for scientific publications [30], words 
contained by web pages resulted from web searching |31j . and identifiers in modern Java, C++ and C 
programs [32] also simultaneously display the Zipf's law and Heaps' law. Bcnz et al. [33) reported the 
Zipf's law of the distribution of the features of small organic molecules, together with the Heaps' law 
about the number of unique features. In particular, the Zipf's law and Heaps' law are closely related to 
the evolving networks. It is well-known that some networks grow in an accelerating manner [341135] and 
have scale- free structures (see for example the WWW [3^ and Internet [37]), in fact, the former property 
corresponds to the Heaps' law that the number of nodes grows in a sublinear form with the total degree 
of nodes, while the latter is equivalent to the Zipf's law for degree distribution. 

Baeza- Yates and Navarro [38] showed that the two laws are related: when a > I, it can be derived that 
if both the Zipf's law and Heaps' law hold, A = — . By using a more polished approach, Leijenhorst and 
Wcidc [39] generalized this result from the Zipf's law to the Mandelbrot's law [40] where z(r) ~ (r c + r)~ a 
and r c is a constant. Based on a variant of the Simon model [16], Montemurro and Zanette [4T]l42] showed 
that the Zipf's law is a result from the Heaps' law with a depending on A and the modeling parameter. 
Also based on a stochastic model, Serrano et al. [37] claimed that the Zipf's law can result in the Heaps' 
law when a > I, and the Heaps' exponent is A = ~. In this paper, we prove that for an evolving 
system with stable Zipf's exponent, the Heaps' law can be directly derived from the Zipf's law without 
the help of any specific stochastic model. The relation A = — is only an asymptotic solution hold for 
very- large-size systems with a > 1. We will refine this result for finite-size systems with a > 1 and 
complement it with a < 1. In particular, we analyze the effects of system size on the Heaps' exponent, 
which are completely ignored in the literature. Extensive empirical analysis on tens of disparate systems 
ranging from keyword occurrences in scientific journals to spreading patterns of the novel virus influenza A 
(HINI) has demonstrated that the refined results presented here can better capture the relation between 
Zipf's and Heaps' exponents. In particular, our results agree well with the evolving regularities of the 
accelerating networks and suggest that the accelerating growth is necessary to keep a stable power-law 
degree distribution. Whereas the majority of studies on the Heaps' law are limited in linguistics, our 
work opens up the door to a much wider horizon that includes many complex systems. 



Results 

Analytical Results 

For simplicity of depiction, we use the language of word statistics in text, where z(r) denotes the frequency 
of the word with rank r. However, the results are not limited to language systems. Note that (r — I) is 
the very number of distinct words with frequency larger than z(r). Denoting by t the total number of 
word occurrences (i.e., size of the text) and N(t) the corresponding number of distinct words, then 

r -1= I N(t)p(z')dz'. (I) 

Jz(r) 

Note that p(z) = Az~@ with A a constant. According to the normalization condition L p(z)dz = I, 
when p > 1 and z max » I (these two conditions are hold for most real systems), A = - ^~\-n ~ P — E 
Substituting p(z') in Eq. [TJby (p — I)z'~ /3 , we have 



r-l = N(t) [zirf-e-zlJ] 



(2) 
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According to the Zipf 's law z{r) = z max • r a and the relation between the Zipf 's and power-law exponents 
j8 = 1 + — , the right part of Eq. [3] can be expressed in term of z max and a, as 

zirf-P-z^^z-^r-l). (3) 
Combine Eq. [T]and Eq. O we can obtain the estimation of z max , as 

^nax « N(t) a . (4) 

Obviously, the text size t is the sum of all words' occurrences, say 
Substituting z max by Eq. [4j it arrives to the relation between N(t) and t: 

N(tr (N(ty- a - 1) 

-, = t- (6) 

1 — a 

The direct comparison between the empirical observation and Eq. 6, as well as an improved version of 
Eq. 6, is shown in Materials and Methods. Clearly, Eq. 6 is not a simply power-law form as described 
by the Heaps' law. We will see that the Heaps' law is an approximate result that can be derived from 
Eq. 6. Actually, when a is considerably larger than 1, 7V(i) 1_Q <C 1 and N(t) w (a — l) 1 / 1 ^ 1 /"; while 
if a is considerably smaller than 1, Nft) 1 "" 3> 1 and N(t) « (1 — a)t. This approximated result can be 
summarized as 

A = f 1 {"' a> \ (7) 
1 1, a < 1, w 



which is in accordance with the previous analytical results [2911381 139] for a > 1 and has complemented 
the case for a < 1. 

Although Eq. 6 is different from a strict power law, numerical results indicate that the relationship 
between N(t) and t can be well fitted by the power-law functions (the fitting is usually much better 
than the empirical observations about the Heaps' law, see Materials and Methods for some typical 
examples). In Fig. 1, we report the numerical results with fixed total number of word occurrences 
t = 10 5 . When a is considerably larger or smaller than 1, the numerical results agree well with the 
known analytical solution in Eq. 7, however, a clear deviation is observed for a « 1 (sec Materials and 
Methods about how to get the numerical results for a = 1). 

To validate the numerical results of Eq. 6, we propose a stochastic model. Given the total number of 
word occurrences t, clearly, there arc at most t distinct words having the chance to appear. The initial 
occurrence number of each of these t words is set as zero. At each time step, these t words arc sorted in 
descending order of their occurrence number (words with the same number of occurrences are randomly 
ordered), and the probability a word with rank r will occur in this time step is proportional to r~ a . The 
whole process stops after t time steps. The distribution of word occurrence always obeys the Zipf's law 
with a stable exponent a, and the growth of N(t) approximately follows the Heaps' law with A dependent 
on a. The simulation results about A vs. a of this model are also reported in Fig. 1, which agree perfectly 
with the numerical ones by Eq. 6. The result of the stochastic model strongly supports the validity of 
Eq. 6, and thus we only discuss the numerical results of Eq. 6. 

In addition to a, the Heaps' exponent A also depends on the system size, namely the total number 
of word occurrences, t. An example for a = 1 is shown in Fig. 2, and how A varies in the (a, t) plane is 
shown in Fig. 3. It is seen that the exponent A increases monotonously as the increasing of t. According 
to Eq. 6, it is obvious that in the large limit of system size, t — > oo, the exponent A can be determined by 
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the asymptotic solution Eq. 7. Actually, the asymptotic solution well describes the systems with a>l 
or a < 1 or t -> cxd. However, real systems are often with a around 1 and of finite sizes. As indicated by 
Fig. 2 and Fig. 3, the growth of A versus t is really slow. For example, when a = 1, for most real systems 
with t scaling from 10 4 to 10 8 , the exponent A is considerably smaller than the asymptotic solution A = 1. 
Even for very large t that is probably larger than any studied real systems, like t — 10 16 , the difference 
between numerical result and asymptotic solution can be observed. As we will show in the next section, 
this paper emphasizes the difference between empirical observations and the asymptotic solution, and 
the simple numerical method based on Eq. 6 provides a more accurate estimation. 

Experimental Results 

We analyze a number of real systems ranging from small-scale system containing only 40 distinct elements 
to large-scale system consisting of more than 10 5 distinct elements. The results are listed in Table 1 while 
the detailed data description is provided in Materials and Methods. Four classes of real systems are 
considered, including the occurrences of words in different books and different languages (data sets Nos. 
1-9), the occurrences of keywords in different journals (data sets Nos. 10-33), the confirmed cases of the 
novel virus influenza A (data set No. 34), and the citation record of PNAS articles (data set No. 35). 
Figure 4 reports the Zipf 's law and Heaps' law of the four typical examples, each of which belongs to one 
class, respectively. 

To sum up, the empirical results indicate that (i) evolving systems displaying the Zipf's law also obey 
the Heaps' law even for small-scale systems; (ii) the asymptotic solution (Eq. 7) can well capture the 
relationship between the Zipf's exponent and Heaps' exponent, and the present numerical result based 
on Eq. 6 can provide considerably better estimations (the numerical results based on Eq. 6 outperforms 
Eq. 7 in 34, out of 35, tested date sets). 

Discussion 

Zipf's law and Heaps' law are well known in the context of complex systems. They were discovered 
independently and treated as two independent statistical laws for decades. Recently, the increasing 
evidence on the coexistence of these two laws leads to serious consideration of their relation. However, a 
clear picture cannot be extracted out from the literature. For example, Montemurro and Zanette |4 1(142] 
suggested that the Zipf's law is a result from the Heaps' law while Serrano et al. [27] claimed that the 
Zipf's law can result in the Heaps' law. In addition, many previous analyses about their relation are 
based on some stochastic models, and the results are strongly dependent on the corresponding models 
- we are thus less confident of their applicability in explaining the coexistence of the two laws observed 
almost everywhere. 

In this article, without the help of any specific stochastic model, we directly show that the Heaps' 
law can be considered as a derivative phenomenon given that the evolving system obeys the Zipf's law 
with a stable exponent. In contrast, the Zipf's law can not be derived from the Heaps' law without the 
help of a specific model or some external conditions. However, one can not conclude that the Zipf's law 
is more fundamental since there may exists some mechanisms only resulting in the Heaps' law, namely 
it is possible that a system displays the Heaps' law while does not obey the Zipf's law. In addition, we 
refine the known asymptotic solution (Eq. 7) by a more complex formula (Eq. 6), which is considerably 
more accurate than the asymptotic solution, as demonstrated by both the testing stochastic model and 
the extensive empirical analysis. In particular, our investigation about the effect of system size fills the 
gap in the relevant theoretical analyses. 

Our analytical result (Eq. 6) indicates that the growth of vocabulary of an evolving system cannot be 
exactly described by the Heaps' law even though the system obeys a perfect Zipf's law with a constant 
exponent. In fact, not only the solution of the Heaps' exponent (Eq. 7), but also the Heaps' law itself 
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is an asymptotic approximation obtained by considering infinite-size systems. More terribly, a Zipf's 
exponent larger than one does not correspond to a true distribution p(z) since (z) will diverge as the 
increasing of the system size, yet a large fraction of real systems can be well characterized by the Zipf's 
law with a > 1 (see general examples in Refs. [3][T5] and examples of degree distributions of complex 
networks in Refs. j46][47]). Putting the blemish in mathematical strictness behind, the Zipf's law and 
Heaps' law well capture the macroscopic statistics of many complex systems, and our analysis provides a 
clear picture of their relation. 

Note that, our analysis depends on an ideal assumption of a "perfect" power law (Zipf's law) of 
frequency distribution, while a real system never displays such a perfect law. Indeed, deviations from a 
power law have been observed, but the assumption of a perfect power-law distribution is widely used in 
many theoretical analyses. For example, the degree distribution in email networks |48j has a cutoff at 
about z — 100 and the one in sexual contact networks [H] displays a drooping head, while in the analysis 
of epidemic dynamics, the underlying networks are usually supposed to be purely scale- free networks |50) . 
Another example is the study on the effects of human dynamics on epidemic spreading |51U52j . where 
the interevent time distribution of human actions are supposed as a power-law distribution, ignoring the 
observed cutoffs and periodic oscillations [53][54] . In a word, although the ideal assumption of a perfect 
power-law distribution could not fully reflect the reality, the corresponding analysis indeed contributes 
much to our understanding of many phenomena. 

An interesting implication of our results lies in the accelerated growth of scale-free networks. Consid- 
ering the degree of a node as its occurrence frequency and the total degree of all nodes as the text size, 
a growing network is analogous to a language system. Then, the scale-free nature corresponds to the 
Zipf's law of word frequency and the accelerated growth corresponds to the Heaps' law of the vocabulary 
growth. In an accelerated growing network, the total degree t (proportional to the number of edges) scales 
in a power-law form as t ~ N(t)^, where N(t) denotes the number of nodes and <j> > 1 is the accelerating 
exponent. At the same time, the degree distribution usually follows a power law as p(k) ~ k~^ where 
k denotes the node degree. For example, the Internet at the autonomous system (AS) level displays the 
scale-free nature with /3 « 2.16 (see Table 1 in Ref. [55]) and thus a = -^—^ w 0.862. According to a recent 
report [37] on empirical analysis of the Internet at the AS level, till December 2006, the total degree is 
t = 105652. The corresponding numerical result of the Heaps' exponent is A « 0.92 and thus the accel- 
erating exponent can be estimated as 4> — t ~ 1-09. In contrast, the asymptotic solution Eq. 7 suggests 
a steady growing as <j> = A = 1. Compared with the empirical result <f> « 1.11 [37], Eq. 6 (<fi — 1.09) 
gives better result than Eq. 7 (<j> — 1). Actually, the asymptotic solution indicates that all the scale-free 
networks with j3 > 2 should grow in a steady (linear) manner, which is against many known empirical 
observations |34H37] . while the refined result in this article is in accordance with them. Further more, 
our result provides some insights on the growth of complex networks, namely the accelerated growth can 
be expected if the network is scale-free with a stable exponent and this phenomenon is prominent when 
(3 is around 2. 



Materials and Methods 

0.1 Relation between Zipf's Law and Power Law 

Given the Zipf's law z(r) ~ r~ Q , we here prove that the probability density function p(z) obeys a power 
law as p(z) <~ z _/3 with f3 = 1 + — . Considering the data points with ranks between r and r + 5r where Sr 
is a very small value. Clearly, the number of data points is 5r, which can be expressed by the probability 
density function as 

5r =p{z(r))8z, (8) 

where 

Sz ~ r~ a - (r + Sr)- a ~ r'^Sr. (9) 



G 



Therefore, we have 

p( r - a ) ~ r- 01 - 1 ~ (r- )-^, (10) 

namely j3 = 1 + — . Analogously, the Zipf's law z(r) ~ r~ a can be derived from the power-law probability 
density distribution p(z) ~ z~@, with a = ^zrf- 

0.2 Direct Comparison between Empirical and Analytical Results 

Given the parameter a, according to Eq. 6, we can numerically obtain the function N(t). The comparison 
between Eq. 6 and the empirical data for words in the book "La Divina Commcdia" and keywords in the 
PNAS articles are shown in Fig. [5j The growing tendency of distinct words can be well captured by Eq. 
6. Actually, using a more accurate normalization condition J*™™ + 2 p(z)dz — 1, as an improved version of 
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Eq. 4, the estimation of z max is determined by 




Given the parameter a, for an arbitrary N(t), one can estimates the corresponding z max according to 
Eq. [TT] and then determines the value of t by Eq. 5. The numerical results of this improved version 
are also presented in Fig. [51 which fits better than Eq. 6 to the empirical data. Notice that, both the 
two analytical results give almost the same slope in the log-log plot of N(t) function, namely the Heaps' 
exponents obtained by these two versions are almost the same. 



0.3 Examples of Numerical Results 

Mathematically speaking, as indicated by Eq. 6, N(t) does not scale in a power law with t. However, 
the numerical results suggest that the dependence of N(t) on t can be well approximated as power-law 
functions. As shown in Fig. HI for a wide range of a, N(t) can be well fitted by t x , and the value of 
fitting exponent A depends on both a and t. 



0.4 The case of a = 1 

The numerical solution of Eq. 6 for a = 1 can be obtained by considering the limitation a — > 1, where 
N(t) a « N(t) and N(t) 1 ~ a « 1 + (1 - a) lniV(i). Accordingly, Eq. 6 can be rewritten as 

N(t)\nN(t) =t. (12) 

When t approaches to infinity, N(t) scales almost linearly with t since lim^co = 0. Actually, the 

solution can be expressed as N(i) = t/W(t) where W(t) is the well-known Lambert W function [5J5] that 
satisfies 

W{t)e w{t) = t. (13) 
For any finite system, the numerical result can be produced by Eq. 1121 



0.5 Data description 

The data sets analyzed in this article can be divided into four classes. According to the data sets shown 
in Table 1, these four classes are as follows. 

(i) Occurrences of words in different books and different languages (data sets Nos. 1-9). The data 
set No. 1 is the English book (Moby Dick) written by Herman Melville; the data sets No. 2 (De Bello 
Gallico), No. 3 (Philosophic Naturalis Principia Mathematica) and No. 7 (Aeneis) are Latin books 
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written by Gaius Julius Caesar, Isaac Newton and Virgil respectively; the data sets No. 4 (Don Quijote), 
No. 5 (La Celestina) and No. 8 (Cien arios de soledad) arc Spanish novels written by Miguel de Cervantes, 
Fernando de Rojas and Gabriel Garcia Marquez, respectively; the data set No. 6 (Faust) is a German 
opera written by Johann Wolfgang von Goethe; the data set No. 9 (La Divina Commedia di Dante) is 
the Italian epic poem written by Dante Alighieri. All the above data are collected by Carpena et al. [44] 
and available at |http: / /bioinfo2.ugr.es/TextKeywords/index.html| 

(ii) Occurrences of keywords in different journals (data sets Nos. 10-33). These 24 journals, from No. 
10 to No. 33 are PNAS, Chin. Sci. Bull., J. Am. Chcm. Soc, Acta Chim. Sinica, Crit. Rev. Biochem. 
Mol. Biol., J. Biochem., J. Nutr. Biochem., Phys. Rev. Lett., Appl. Phys. Lett., Physica A, ACM 
Comput. Surv., ACM Trans. Graph., Comput. Netw., ACM Trans. Comput. Syst., Econmetrica, J. 
Econ. Thco., SIAM Rev., SIAM J. Appl. Math., Invent. Math., Ann. Neurol., J. Evol. Biol., Theo. 
Popul. Biol., MIS Quart., and IEEE Trans. Automat. Contr.. These data are collected from the ISI 
Web of Knowledge ( http: / /isiknowledge.com/| . For every scientific journal, we consider the keywords 
sequence in each article according to its publishing time. Since most of the published articles do dot 
have keywords before 1990 in ISI database, we limit our collections from 1991 to 2007 (except for ACM 
Comput. Surv. which is available only from 1994 to 1999). 

(iii) Confirmed cases of the novel virus influenza A (data set No. 34). The data of the cumulative 
number of laboratory confirmed cases of H1N1 of each country arc available from the website of Epidemic 
and Pandemic Alert of World Health Organization (WHO) ( |http://www.who.mt/ ). The analyzed data 
set reported influenza A starting from April 26 to May 18, updated each one or two days. After May 18, 
the distribution of confirmed cases in each country shifted from a power law to a power-law form with 
exponential cutoff [45] . 

(iv) Citation record of PNAS articles (data set No. 35). This data set consists of all the citations to 
PNAS articles from papers published between 1915 and 2009 according to the ISI database, ordered by 
time. 
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Figure 1. Relationship between the Heaps' exponent A and Zipf's exponent a. The solid 
curve represents the asymptotic solution shown in Eq. 7, the dash curve is the numerical result based 
on Eq. 6, and the circles denote the result from the stochastic model. For the numerical result and the 
result of the stochastic model, the total number of word occurrences is fixed as t = 10 5 . 




Figure 3. Heaps' exponent A as a function of (a,t). 
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Figure 4. Zipf 's law and Heaps' law in four example systems, (a) Words in Dante Alghieri's 
great book "La Divina Commcdia" in Italian [44| where Z(r) is the frequency of the word ranked r and 
N(t) is the number of distinct words, (b) Keywords of articles published in the Proceedings of the 
National Academy of Sciences of the United States of America (PNAS) [30] where Z(r) is the frequency 
of the keyword ranked r and N(t) is the number of distinct keywords; (c) Confirmed cases of the novel 
virus influenza A (H1N1) [45] where Z(r) is the number of confirmed cases of the country ranked r and 
N(t) is the number of infected country in the presence of t confirmed cases over the world; (d) PNAS 
articles having been cited at least once from 1915 to 2009 where Z(r) is the number of citations of the 
article ranked r and N(t) is the number of distinct articles in the presence of t citations to PNAS. In 
(c), the data set is small and thus the effective number is only two digits. The fittings in (cl) and (c2) 
only cover the area marked by blue. In (dl), the deviation from a power law is observed in the head and 
tail, and thus the fitting only covers the blue area. The Zipf's (power-law) exponents and Heaps' 
exponents are obtained by using the maximum likelihood estimation [3][43] and least square method, 



_„ „f /„\ fu\ r„\ 
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Figure 5. Direct comparison between the empirical data and Eq. 6 as well as its improved 
version. The left and right plots are for the words in "La Divina Commedia" and the keywords in 
PNAS. The blue dash lines and red solid lines present the results of Eq. 6 and Eq. [TTJ respectively. In 
accordance with Figure 4 and Table 1, the values of the parameter a are given as 1.117 and 0.893, 
respectively. 




Figure 6. N(t) vs. t according to the numerical results of Eq. 6. The red, black and blue line 
corresponding to the cases of a = 0.5, a = 1.0 and a = 1.5. The system sizes (i.e., the total number of 
word occurrences), from left to right, are t = 10 5 , t = 10 7 and 10 9 . Fitting exponent A is obtained by 
the least square method. The fitting lines and numerical results almost completely overlap. 
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Table 1. Empirical statistics and analysis results of real data sets. T is the total number of 
elements, N(T) is the total number of distinct elements, a is the Zipf's exponent obtained by the 
maximum likelihood estimation [51143] . A a is the asymptotic solution of the Heaps' exponent as shown in 
Eq. 7, A„ is the numerical value of the Heaps' exponent given T and a as shown in Fig. 3, and A e is the 
empirical result of the Heaps' exponent obtained by the least square method. The effective number of 
the 34th data set is only two digits since the size of this data set is very small. Except the 4th data set, 
in all other 34 real data sets, the numerical results based on Eq. 6 outperform the asymptotic solution 
shown in Eq. 7. Detailed description of these data sets can be found in Materials and Methods. 



No. 


T 


N(T) 


a 


A a 


An 


A e 


1 


206779 


18217 


1.323 


0.756 


0.725 


0.738 


2 


20516 


5671 


0.969 


1 


0.858 


0.859 


3 


109854 


13906 


1.063 


0.941 


0.845 


0.817 


4 


449205 


20220 


1.464 


0.683 


0.667 


0.679 


5 


68458 


9191 


1.095 


0.913 


0.823 


0.810 


6 


81037 


13254 


1.025 


0.976 


0.859 


0.832 


7 


63742 


16622 


1.057 


0.946 


0.840 


0.852 


8 


138985 


15550 


1.188 


0.842 


0.787 


0.765 


9 


101940 


12667 


1.117 


0.895 


0.818 


0.799 


10 


504610 


116800 


0.893 


1 


0.936 


0.863 


11 


53214 


34194 


0.540 


1 


0.983 


0.946 


12 


310853 


69185 


0.939 


1 


0.913 


0.871 


13 


30852 


17562 


0.595 
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0.972 


0.939 


14 


2761 


2328 


0.397 


1 


0.964 


0.978 


15 


58300 


22599 


0.786 


1 


0.941 


0.914 


16 


20660 


8155 


0.790 


1 


0.921 


0.890 


17 


226090 


69251 


0.692 


1 


0.977 


0.894 


18 


176291 


62567 


0.572 


1 


0.989 


0.920 


19 


44735 


19933 


0.685 


1 


0.961 


0.915 


20 


1924 


1323 


0.463 
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0.946 


0.939 


21 


5093 


2985 


0.593 
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0.941 


0.920 


22 


3490 


2442 


0.500 
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0.952 


0.950 


23 


1403 


787 


0.524 


1 


0.926 


0.931 


24 


7469 


4142 


0.654 


1 


0.936 


0.925 


25 


7710 


3857 


0.658 


1 


0.935 


0.930 


26 


3232 


2658 


0.416 
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0.964 


0.976 


27 


13165 


7743 


0.612 
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0.959 


0.936 


28 


3749 


2353 


0.568 
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0.943 


0.940 


29 


30092 


11002 


0.815 


1 


0.924 


0.891 


30 


21894 


8666 


0.776 


1 


0.930 


0.900 


31 


7627 


3841 


0.685 


1 


0.933 


0.930 


32 


4185 


2242 


0.675 


1 


0.921 


0.929 


33 


23822 


10753 


0.648 


1 


0.959 


0.917 


34 


8829 


40 


3.0 


0.33 


0.34 


0.35 


35 


237982 


56961 


0.462 
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0.993 


0.929 



